Tricks of the Mac Game Programming Gurus

home *** CD-ROM | disk | FTP | other *** search

/ Tricks of the Mac Game Programming Gurus / TricksOfTheMacGameProgrammingGurus.iso / Information / CSMP Digest / volume 1 / csmp-v1-198.txt < prev next >

Wrap

Text File | 1994-12-08 | 47.2 KB | 1,298 lines | [TEXT/R*ch]

C.S.M.P. Digest Wed, 28 Oct 92 Volume 1 : Issue 198 Today's Topics: Better random number than Random()? BlockZero.c snippet The Comp.Sys.Mac.Programmer Digest is moderated by Michael A. Kelly. The digest is a collection of article threads from the internet newsgroup comp.sys.mac.programmer. It is designed for people who read c.s.m.p. semi- regularly and want an archive of the discussions. If you don't know what a newsgroup is, you probably don't have access to it. Ask your systems administrator(s) for details. (This means you can't post questions to the digest.) Each issue of the digest contains one or more sets of articles (called threads), with each set corresponding to a 'discussion' of a particular subject. The articles are not edited; all articles included in this digest are in their original posted form (as received by our news server at cs.uoregon.edu). Article threads are not added to the digest until the last article added to the thread is at least one month old (this is to ensure that the thread is dead before adding it to the digest). Article threads that consist of only one message are generally not included in the digest. The entire digest is available for anonymous ftp from ftp.cs.uoregon.edu [128.223.8.8] in the directory /pub/mac/csmp-digest. Be sure to read the file /pub/mac/csmp-digest/README before downloading any files. The most recent issues are available from sumex-aim.stanford.edu [36.44.0.6] in the directory /info-mac/digest/csmp. If you don't have ftp capability, the sumex archive has a mail server; send a message with the text '$MACarch help' (no quotes) to LISTSERV@ricevm1.rice.edu for more information. The digest is also available via email. Just send a note saying that you want to be on the digest mailing list to mkelly@cs.uoregon.edu, and you will automatically receive each new issue as it is created. Sorry, back issues are not available through the mailing list. Send administrative mail to mkelly@cs.uoregon.edu. ------------------------------------------------------- From: hd12@ellis.uchicago.edu (hui dong) Subject: Better random number than Random()? Date: 18 Sep 92 10:54:10 GMT Organization: University of Chicago Computing Organizations (I did check FAQ before I post this.) I need a fast (speed is very important) random number generator. Right now I am using Random(), but it seems to me the distribution is not uniform at all. Is there a better generator? Will it be faster to use my own function than to use toolbox routine? Thanks for the help. +++++++++++++++++++++++++++ From: fry@tara.harvard.edu (David Fry) Date: 19 Sep 92 05:20:38 GMT Organization: Harvard Math Department In article <1992Sep18.105410.24438@midway.uchicago.edu> hd12@midway.uchicago.edu writes: >(I did check FAQ before I post this.) > >I need a fast (speed is very important) random number generator. Right now >I am using Random(), but it seems to me the distribution is not uniform >at all. Is there a better generator? Will it be faster to use my own >function than to use toolbox routine? >Thanks for the help. This probably should be in the FAQ list because it comes up all the time (and it seems I'm always the one to answer it). Random() is a quite good random number generator, and it is pretty fast. It will almost certainly be faster than one you write yourself, provided the one you write yourself is as good. Are you changing the randSeed seed variable before calling it the first time? You can use the result of a GetDateTime() call as the initial seed. Why do you think it isn't uniform? Are you doing anything strange with the results of Random(), like using a "mod" to get a smaller random number? This decreases the quality of your random numbers, although not as much with Random() as with some other RNG's. David Fry fry@math.harvard.edu Division of Applied Sciences fry@huma1.bitnet Harvard University ...!harvard!huma1!fry Cambridge, MA 02138 +++++++++++++++++++++++++++ From: jvp@vlsi1.dell.com (Jim Van Peursem) Date: 19 Sep 92 15:56:43 GMT Organization: Iowa State University, Ames IA In <1992Sep19.012039.15704@husc3.harvard.edu> fry@tara.harvard.edu (David Fry) writes: >>I need a fast (speed is very important) random number generator. Right now >>I am using Random(), but it seems to me the distribution is not uniform >>at all. Is there a better generator? Will it be faster to use my own >>function than to use toolbox routine? >>Thanks for the help. >This probably should be in the FAQ list because it comes up all the >time (and it seems I'm always the one to answer it). Random() is a >quite good random number generator, and it is pretty fast. It will >almost certainly be faster than one you write yourself, provided the >one you write yourself is as good. What kind of RNG is Random()? Isn't it a linear congruential? If it is, better quality RNG's exist such as Lagged Fibbonacci. +---------------------------------------------------------------+ | Jim Van Peursem - Ph.D. Student - Ham Radio -> KE0PH | | System Administrator - Digital Systems Design Lab | | Department of Electrical Engineering and Computer Engineering | | Iowa State University - Ames, IA 50011 | | internet - jvp@cpre1.ee.iastate.edu | +---------------------------------------------------------------+ +++++++++++++++++++++++++++ From: sw@network-analysis-ltd.co.uk (Sak Wathanasin) Date: 19 Sep 92 08:45:28 GMT Organization: Network Analysis Ltd In article <1992Sep18.105410.24438@midway.uchicago.edu> (comp.sys.mac.programmer), hd12@ellis.uchicago.edu (hui dong) writes: > (I did check FAQ before I post this.) Well, it ought to be in the FAQ, since this comes up every few months. It is covered in the Usenet Mac Programmer's Guide. > I need a fast (speed is very important) random number generator. Right now > I am using Random(), but it seems to me the distribution is not uniform > at all. Is there a better generator? Will it be faster to use my own > function than to use toolbox routine? Ignore the 16-bit result returned by Random(), and use the full 32-bit value that is in randSeed. Details are in the UMPG. The Random() in rom is the same as the Park & Miller "minimal standard RNG" described in a CACM paper, and guards against overflows during intermediate calculations. You will be hard pushed to write a faster version in C or Pascal, and will have to use assembler to beat it. Sak Wathanasin Network Analysis Limited 178 Wainbody Ave South, Coventry CV3 6BX, UK uucp: ...!uknet!nan!sw Phone: (+44) 203 419996 AppleLink: NAN.LTD Internet: sw@network-analysis-ltd.co.uk +++++++++++++++++++++++++++ From: fry@tara.harvard.edu (David Fry) Date: 19 Sep 92 13:42:12 EDT Organization: Harvard Math Department In article <jvp.716918203@vlsi1> jvp@vlsi1.dell.com (Jim Van Peursem) writes: > What kind of RNG is Random()? Isn't it a linear congruential? >If it is, better quality RNG's exist such as Lagged Fibbonacci. Yes, it's linear congruential. Certainly better RNG's exist, but for more post purposes, Random() is more than good enough. It follows the so-called "minimal standard" RNG proposed in "Random Number Generators: Good Ones are Hard to Find" by Park and Miller, in Communications of the ACM, October 1988. This article details the history of faulty RNGs and explains why this is a fast and good one. As someone else pointed out, to really get the best quality from Random(), you should ignore there result of the function (which is 16-bits), and use the value in randSeed (which is 32-bits) after calling Random(). In that case you have exactly the RGN described in the CACM paper. I think even using the 16-bit number is good enough for most purposes because the lower 16-bits of the RNG are sufficiently random. Still, I admit to using the full 32-bits in my work :-). David Fry fry@math.harvard.edu Division of Applied Sciences fry@huma1.bitnet Harvard University ...!harvard!huma1!fry Cambridge, MA 02138 +++++++++++++++++++++++++++ From: Bruce.Hoult@bbs.actrix.gen.nz Organization: Actrix Information Exchange Date: Sun, 20 Sep 1992 04:06:49 GMT In article <1992Sep18.105410.24438@midway.uchicago.edu> hd12@midway.uchicago.edu writes: > (I did check FAQ before I post this.) > > I need a fast (speed is very important) random number generator. Right now > I am using Random(), but it seems to me the distribution is not uniform > at all. Is there a better generator? Will it be faster to use my own > function than to use toolbox routine? > Thanks for the help. The toolbox Random() functions is perfectly OK if you use the *seed* as your random value and ignore the function result. OTOH, it's not exactly fast because of the Trap call overhead. I've included my own random number generator, which requires a 68020 or better. I've also included my attempt at a random generator that produces "normally" distributed numbers. It's based on the Central Limits Theorem that says the sum of several independant random variables is approximately normally distributed, no matter what distribution the individual variables have. I found that for my purposes the sum of ten uniformally distributed variables was good enough. The only mildly clever thing about it is that I do a little parallel computing by working out the random numbers in the CPU and then adding them (to get a greater then 32 bit result) in the coprocessor. Comment is invited on whether this is a good way to generate a normal distribution, or not. - -- Bruce - --------------- procedure MyRandom(var seed:longint); inline $225F, {movea.l (a7)+,a1} $2011, {move.l (a1),d0} $4C3C, $0401, $0000, $41A7, {mulu.l #41a7,d1:d0} $4C7C, $0401, $7FFF, $FFFF, {divu.l #7fffffff,d1:d0} $2281; {move.l d1,(a1)} procedure MyNormal( numToAdd:integer; var seed:longint; var result:extended ); external; - ------------------------------------------------------ machine mc68020 mc68881 MyNormal proc export link a6,#0 fmove.w #0,fp0 move.l 8(a6),a1 ;pointer to result move.l 12(a6),a0 ;pointer to seed move.l (a0),d0 move.w 16(a6),d2 ;number of loops bra.b loopend loop mulu.l #16807,d1:d0 ;7^5 divu.l #$7fffffff,d1:d0 fadd.l d1,fp0 move.l d1,d0 loopend dbra d2,loop move.l d0,(a0) ;update the seed fmove.x fp0,(a1) ;return the result unlk a6 move.l (a7)+,a0 add.l #10,a7 jmp (a0) endproc end - -- Bruce.Hoult@bbs.actrix.gen.nz Twisted pair: +64 4 477 2116 BIX: brucehoult Last Resort: PO Box 4145 Wellington, NZ "Cray's producing a 500 MIPS personal computer with 256MB RAM and 8 GB hard disk that fits in your pocket!" "Great! Is it PC compatible?" +++++++++++++++++++++++++++ From: ksand@apple.com (Kent Sandvik) Date: 20 Sep 92 19:06:23 GMT Organization: Apple In article <1992Sep19.012039.15704@husc3.harvard.edu>, fry@tara.harvard.edu (David Fry) wrote: > In article <1992Sep18.105410.24438@midway.uchicago.edu> hd12@midway.uchicago.edu writes: > >I need a fast (speed is very important) random number generator. Right now > >I am using Random(), but it seems to me the distribution is not uniform > >at all. Is there a better generator? Will it be faster to use my own > >function than to use toolbox routine? > >Thanks for the help. > This probably should be in the FAQ list because it comes up all the > time (and it seems I'm always the one to answer it). Random() is a > quite good random number generator, and it is pretty fast. It will > almost certainly be faster than one you write yourself, provided the > one you write yourself is as good. True, at least the Usenet Mac Programmer Guide has a lot of information about alternate random generator algorithms and code. I've done some tests myself, and it wasn't really justified to write alternate random generators, for instance I didn't see much performance improvements by inlining hex code instead of calling a trap. Check the TRandom C++ class on the developer CD for my silly tests. Cheers, Kent/DTS "You should really just relax" -MST3K - ------------------- Kent Sandvik (UUCP: ....!apple!ksand; INTERNET: ksand@apple.com) DISCLAIMER: Private activities on the Net. +++++++++++++++++++++++++++ From: Bruce.Hoult@bbs.actrix.gen.nz Date: 21 Sep 92 01:34:32 GMT Organization: Actrix Information Exchange In article <ksand-200992120624@wintermute.apple.com> ksand@apple.com (Kent Sandvik) writes: > I've done some > tests myself, and it wasn't really justified to write alternate random > generators, for instance I didn't see much performance improvements > by inlining hex code instead of calling a trap. Check the TRandom C++ > class on the developer CD for my silly tests. What machine were you testing on? Whn I wrote the code I posted last night I was on a IIcx and there was a *massive* difference in trap overhead vs procedure call overhead. The difference seems to have gotten much smaller on the '040 -- in a test a while ago on the Q700 I found that a function call and return with three longword parameters and a trivial body (like SetPt) took just over 1 uS while a similar trap call took 3 uS. (I hope i've remembered those right :-) - -- Bruce.Hoult@bbs.actrix.gen.nz Twisted pair: +64 4 477 2116 BIX: brucehoult Last Resort: PO Box 4145 Wellington, NZ "Cray's producing a 500 MIPS personal computer with 256MB RAM and 8 GB hard disk that fits in your pocket!" "Great! Is it PC compatible?" +++++++++++++++++++++++++++ From: ksand@apple.com (Kent Sandvik) Date: 22 Sep 92 02:02:51 GMT Organization: Apple In article <1992Sep21.013432.18668@actrix.gen.nz>, Bruce.Hoult@bbs.actrix.gen.nz wrote: > Whn I wrote the code I posted last night I was on a IIcx and there was > a *massive* difference in trap overhead vs procedure call overhead. > The difference seems to have gotten much smaller on the '040 -- in a > test a while ago on the Q700 I found that a function call and return > with three longword parameters and a trivial body (like SetPt) took just > over 1 uS while a similar trap call took 3 uS. (I hope i've > remembered those right :-) When I think about it my tests were on a Quadra 900. I will do some more additional tests on other machines later. Kent "You should really just relax" -MST3K - ------------------- Kent Sandvik (UUCP: ....!apple!ksand; INTERNET: ksand@apple.com) DISCLAIMER: Private activities on the Net. --------------------------- From: mgleason@cse.unl.edu (Mike Gleason) Subject: BlockZero.c snippet Date: 15 Sep 92 18:50:19 GMT Organization: NCEMRSoft I think J. McCarthy & M. Mora started a good idea by posting code snippets here, so here's a freebie function that clears an area of memory (which can start on an odd boundary). It's much faster than using memset(ptr,0,n) because this function uses move.l's for the majority of the block. It compiles under Think C, but the asm {} blocks may break other compilers. I use this function all the time, especially when dealing with parameter blocks. e.g.: #define ZERO(a) BlockZero(&(a), sizeof (a)) { HParamBlockRec pb; ZERO(pb); } Code follows. Further optimizations, your cool code snippets, bug fixes <gulp> welcome.... - ---------------- void BlockZero(void *area, register unsigned long n); void BlockZero(void *area, register unsigned long n) { register char *p = (char *)area; register unsigned long fours; register unsigned long remainder, zero; if (n) { if ((long) p & 01) { /* Ptr on an odd boundary... */ asm { clr.b (p)+ } --n; /* * Ptr is now on an even boundary, * and it's first byte is zeroed. */ } fours = n / 4L; /* Can be zero. */ /* Do the meat of the block using longwords for speed. */ asm { clr.l zero bra @2 @1: move.l zero, (p)+ @2: subq.l #1, fours tst.l fours bge @1 } /* * If the size of the block isn't divisible by 4, we'll * have to do the remaining bytes (up to 3) by hand. */ remainder = n & 03; /* same as remainder = n % 4. */ asm { bra @4 @3: clr.b (p)+ @4: dbra remainder, @3 } } } /* BlockZero */ /* eof BlockZero.c */ - -- - --mg mgleason@cse.unl.edu +++++++++++++++++++++++++++ From: keith@taligent.com (Keith Rollin) Date: 16 Sep 92 17:55:14 GMT Organization: Taligent In article <1992Sep15.185019.22483@unlinfo.unl.edu>, mgleason@cse.unl.edu (Mike Gleason) wrote: > > I think J. McCarthy & M. Mora started a good idea by posting code snippets here, > so here's a freebie function that clears an area of memory (which can start > on an odd boundary). It's much faster than using memset(ptr,0,n) because > this function uses move.l's for the majority of the block. It compiles > under Think C, but the asm {} blocks may break other compilers. > > [ code deleted ] > True, not only will this not work under MPW (which doesn't have inline asm), but MPW's memset library routine is better than the one you post. Specifically, for large enough blocks, it uses a series of 4 MOVE.L instructions in a row for setting the block. On the other hand, I don't know what the timing differences between MOVE.L and CLR.L are, so perhaps a specialized bzero() based on MPW's algorithm would be nice. - ----- Keith Rollin Phantom Programmer Taligent, Inc. +++++++++++++++++++++++++++ From: nagel@Cigna.COM (Mark Nagel) Date: 16 Sep 92 23:09:16 GMT Organization: CIGNA FIRST, Irvine, CA In <keith-160992104732@kip-58.taligent.com> keith@taligent.com (Keith Rollin) writes: >instructions in a row for setting the block. On the other hand, I don't >know what the timing differences between MOVE.L and CLR.L are, so perhaps a >specialized bzero() based on MPW's algorithm would be nice. I'm curious about this. I picked up a Motorola 68K family processor manual, but it doesn't have instruction timing info. I have noticed that THINK C uses MOVEQ.L #0, D0, for example, rather than CLR.L D0. I gather from this that MOVEQ.L is faster, but I'd still like to see timings. Anyone know a good assembler book for the 68K that has info like that? Something along the lines of "Zen and the Art of Assembly Language Programming" would be wonderful. Mark - -- Mark Nagel <nagel@cigna.com> | DISCLAIMER: Any resemblence of these Network Administrator | opinions to those of CIGNA are purely CIGNA/FIRST | coincidental. (Don't start vast projects with half-vast ideas) +++++++++++++++++++++++++++ From: gurgle@netcom.com (Pete Gontier) Date: 17 Sep 92 02:47:01 GMT Organization: cellular nagel@Cigna.COM (Mark Nagel) writes: >I'm curious about this. I picked up a Motorola 68K family processor >manual, but it doesn't have instruction timing info. I have noticed >that THINK C uses MOVEQ.L #0, D0, for example, rather than CLR.L >D0. I gather from this that MOVEQ.L is faster, but I'd still like >to see timings. Notice, though, that MOVEQ has very limited destination operand types. If I remember correctly, registers only. So that renders its use in the context of bzero( ) pretty dubious. Me, I just call NewHandleClear in the first place and assume Apple's Cray can write better 68000 than me. :-) - -- Pete Gontier // EC Technology // gurgle@netcom.com +++++++++++++++++++++++++++ From: Bruce.Hoult@bbs.actrix.gen.nz Date: 17 Sep 92 05:42:50 GMT Organization: Actrix Information Exchange In article <1992Sep15.185019.22483@unlinfo.unl.edu> mgleason@cse.unl.edu (Mike Gleason) writes: > I think J. McCarthy & M. Mora started a good idea by posting code snippets here, > so here's a freebie function that clears an area of memory (which can start > on an odd boundary). It's much faster than using memset(ptr,0,n) because > this function uses move.l's for the majority of the block. It compiles > under Think C, but the asm {} blocks may break other compilers. You should add in a possible 2 byte move at the start and end to align the pointer on a longword boundary. This will increase speed on the 32-bit bus machines. You should unroll the main loop to do several move.l's each time around. You should use dbra for the loop control (only two bytes longer than your version and much faster)... bra @3 @1 swap fours @2 move.l zero,(p)+ # or several of them.. @3 dbra fours, @2 swap fours dbra fours, @1 - -- Bruce.Hoult@bbs.actrix.gen.nz Twisted pair: +64 4 477 2116 BIX: brucehoult Last Resort: PO Box 4145 Wellington, NZ "Cray's producing a 500 MIPS personal computer with 256MB RAM and 8 GB hard disk that fits in your pocket!" "Great! Is it PC compatible?" +++++++++++++++++++++++++++ From: Bruce.Hoult@bbs.actrix.gen.nz Date: 17 Sep 92 10:08:16 GMT Organization: Actrix Information Exchange In article <1992Sep16.230916.18166@Cigna.COM> nagel@Cigna.COM (Mark Nagel) writes: > I'm curious about this. I picked up a Motorola 68K family processor > manual, but it doesn't have instruction timing info. I have noticed > that THINK C uses MOVEQ.L #0, D0, for example, rather than CLR.L > D0. I gather from this that MOVEQ.L is faster, but I'd still like > to see timings. Anyone know a good assembler book for the 68K that > has info like that? Something along the lines of "Zen and the Art > of Assembly Language Programming" would be wonderful. My official Motorola 68000 (M68000UM/AD REV 4 1986) and 68020 (MC68020UM/AD REV 1 1985) books both have tables of timing information, but the 68020 one is full of disclaimers about how complex figuring timings is on modern architectures, and in fact I think motorola quit publishing instruction timing tables. You've really got to try *your* instruction sequence to find out, unless you've got really intimate knowledge of the undocumented internals of the pipeline etc. - -- Bruce.Hoult@bbs.actrix.gen.nz Twisted pair: +64 4 477 2116 BIX: brucehoult Last Resort: PO Box 4145 Wellington, NZ "Cray's producing a 500 MIPS personal computer with 256MB RAM and 8 GB hard disk that fits in your pocket!" "Great! Is it PC compatible?" +++++++++++++++++++++++++++ From: mgleason@cse.unl.edu (Mike Gleason) Date: 17 Sep 92 18:29:11 GMT Organization: NCEMRSoft Bruce.Hoult@bbs.actrix.gen.nz writes: >You should unroll the main loop to do several move.l's each time around. >You should use dbra for the loop control (only two bytes longer than your >version and much faster)... I tried this for the main block, but if I recall, dbra's loop counter was acting like a 16-bit word instead of a nice big longword. Some other folks have sent me some tips, so I will tweak it more using these tricks that aren't covered in my VAX Assembler book. I have attempted to get my hands on a 68000 assembly book for the mac, but so far all the books mentioned in the FAQ (and the one E. Wylie suggested recently) are either out of print or my book store can't order them :-( - -- - --mg mgleason@cse.unl.edu +++++++++++++++++++++++++++ From: mgleason@cse.unl.edu (Mike Gleason) Date: 18 Sep 92 05:27:49 GMT Organization: NCEMRSoft Here's a much improved BlockZero function. This one now long word aligns, and instead of using one small move.l loop to clear 4 bytes at a time, this version clears 144 bytes per loop iteration, then 16 bytes at a time when the block has less than 144 dirty bytes left, then 1 byte at a time for any remaining bytes. In other words, it performs OK on small blocks, pretty good on medium size blocks, and excellent on large blocks. Of course there aren't many times where one would need to zero a big block, since most big blocks are pointers that can be cleared the same time they are allocated, but it still makes a good general-purpose bzero. Thanks to everyone who responded, especially John Bruner and Michael Hecht. I'm sure there are more optimizations that could be done, but I'll let the Gods of the 68000 worry about that :-) Here it is, it requires Think C to compile. /* BlockZero.c */ /* Handy macro: */ #define ZERO(block,size) BlockZero(&(block), (long) (size)) void BlockZero(void *area, long n); #define PTR a0 /* Our copy of 'area.' */ #define COUNT d0 /* Our copy of 'n.' */ /* * Using registers d1 - d7, and a1 - a5 to hold zeroes for use * with MOVEM. Could use a6 also if you can trick the inline * assembler to not use LINK/UNLK. */ #define nAddrRegsUsed 5 #define nDataRegsUsed 7 /* * One MOVEM instruction will clear this many bytes at a time. */ #define ZSize ((nAddrRegsUsed + nDataRegsUsed) * sizeof (long)) /* * This is how many MOVEM instructions you want in your loop. For each * extra one you add, then large blocks will clear faster; however * smaller blocks will be slower because it will use the intermediate * loop, where only 16 bytes are cleared at a time. */ #define UnrolledLoops 3 /* * For each UnrolledLoop, do one of these. */ #define Loop movem.l d1-d7/a1-a5, -(PTR) /* * This is how many bytes are cleared by all of your MOVEMs before * we have to decrement the count and start another loop iteration. */ #define TotalSize (ZSize * UnrolledLoops) void BlockZero(void *area, long n) { asm { movem.l d0-d7/a0-a5, -(sp) move.l n, COUNT ; Could use index from the SP movea.l area, PTR ; for these instead of a6. add.l COUNT, PTR ; Start at end of block, ; and go backwards. move.w PTR, d1 ; Find (area + n) % 4. andi.w #3, d1 sub.l d1, COUNT ; We will byte-zero this ; many bytes to longword align, ; so subtract this from the ; total. moveq.l #0, d2 bra @2 ; Zero the last 3 or so @1: move.b d2, -(PTR) ; bytes to align PTR on a @2: dbra d1, @1 ; longword boundary. ; Clear out as many registers ; as possible to use with MOVEM. move.l d2, d1 ; Don't need d1 for scratch. move.l d2, d3 move.l d2, d4 cmpi.l #TotalSize, COUNT bls @6 ; If it is a smaller block, then ; jump to the section that clears ; 16 at a time. ; Otherwise we'll need to clear ; out the rest of these registers ; to use in a huge MOVEM. move.l d2, d5 move.l d2, d6 move.l d2, d7 movea.l d2, a1 movea.l a1, a2 movea.l a1, a3 movea.l a1, a4 movea.l a1, a5 ; Now zero as much as we can ; in blocks of 144 (3 movem's ; of 48 bytes each). bra @4 @3: Loop ; Add more Loops here if you Loop ; like, but adjust the #defines Loop ; above. @4: subi.l #TotalSize, COUNT bge @3 ; The remaining block is now ; less than TotalSize, so now ; zero as much as we can ; in blocks of 16. addi.l #TotalSize, COUNT bra @6 @5: movem.l d1-d4, -(PTR) @6: subi.l #16, COUNT bge @5 ; The remaining block is now ; less than 16, so now ; use byte-zeroing. addi.l #16, COUNT bra @8 @7: move.b d2, -(PTR) @8: dbra COUNT, @7 ; Restore nuked registers. movem.l (sp)+,d0-d7/a0-a5 } } /* BlockZero */ /* eof BlockZero.c */ - -- ______________________________________________________________________________ mike gleason mgleason@cse.unl.edu NCEMRSoft, baby! +++++++++++++++++++++++++++ From: d88-jwa@musta.nada.kth.se (Jon Wtte) Date: 18 Sep 92 09:59:59 GMT Organization: Royal Institute of Technology, Stockholm, Sweden > mgleason@cse.unl.edu (Mike Gleason) writes: I tried this for the main block, but if I recall, dbra's loop counter was acting like a 16-bit word instead of a nice big longword. Well, using the SWAP function you can use DBRA as a 32-bit counter. Still beats the MS-DOS programming model :-) - -- Jon W{tte, h+@nada.kth.se, Sweden, Phone +46-8-107069 Help eradicate FIDO-Net <-> Usenet gateways in our time! +++++++++++++++++++++++++++ From: Bruce.Hoult@bbs.actrix.gen.nz Date: 18 Sep 92 07:55:21 GMT Organization: Actrix Information Exchange In article <1992Sep17.182911.27693@unlinfo.unl.edu> mgleason@cse.unl.edu (Mike Gleason) writes: > Bruce.Hoult@bbs.actrix.gen.nz writes: > > >You should use dbra for the loop control (only two bytes longer than your > >version and much faster)... > > I tried this for the main block, but if I recall, dbra's loop counter > was acting like a 16-bit word instead of a nice big longword. Correct -- that's why the code I showed used *two* DBRA's in nested loops, with the tricky little SWAPs to get the high word of the count register into the low word for the outer DBRA and back again. It's a little bit ugly but correctly handles a 32-bit count, is only 2 bytes bigger than what you were doing, and the extra SWAPs only get executed once every 65536 loops so their overhead is almost zero. All of which make this method (thanks to Lawrence for originally pointing the method out!) overall tidier than the obvious nested DBRA method of putting 16-bit counts in two different registers -- which has a little less overhead every 65536 loops, but uses at minimum an extra MOVE.L and SWAP in the loop prologue and uses an extra precious register, for the same size code. Isn't it amazing how much fun saving a couple of bytes or cycles can be? ;-) - -- Bruce.Hoult@bbs.actrix.gen.nz Twisted pair: +64 4 477 2116 BIX: brucehoult Last Resort: PO Box 4145 Wellington, NZ "Cray's producing a 500 MIPS personal computer with 256MB RAM and 8 GB hard disk that fits in your pocket!" "Great! Is it PC compatible?" +++++++++++++++++++++++++++ From: wombat@claris.com (Scott Lindsey) Date: 18 Sep 92 09:18:30 GMT Organization: Claris Corporation, Vancouver WA In article <keith-160992104732@kip-58.taligent.com>, keith@taligent.com (Keith Rollin) writes: > > On the other hand, I don't > know what the timing differences between MOVE.L and CLR.L are, so perhaps a Well, according to the 5th edition of the M68000 Programmer's Reference Manual from Motorola, MOVE.L #<data>,(An)+ 20(3/2) CLR.L (An)+ 24(2/4) + 16(4/0) where the first number is clock periods, and the 2 parenthesized are bus read & write cycles. the 16(4/0) is the time required for the (An)+ effective address calculation. So on a 68000, the MOVE beats the CLR. The manual also includes timings for the 68010. The MOVE timing is same, but the CLR timing is 12(1/2) for (An)+ with no additional time for address calculation. So CLR is the winner. I don't have timings for 020, 030, 040's, but one would expect them to be more like the 010 than the 000. Note that the 010 also has a "loop mode" where opcode fetches are suppressed, speeding up the loop even more. So I would guess that if you're optimizing for the 68000 (which needs it the most) then use MOVE, but use CLR if you're optimizing for > 68000. - -- Scott Lindsey <wombat@claris.com> QUIT Scott Lindsey <wombat@claris.com> +++++++++++++++++++++++++++ From: peterc@cubetech.com (Peter Creath) Date: Thu, 17 Sep 92 22:25:37 CDT Organization: Cube Technologies Here's my optimization on Mike Gleason's BlockZero.c snippet: (I'm gonna remove the comments within the code and explain it here...) This is shorter, primarily because the code itself (not the declarations) is entirely in assembly. Therefore, it can be used by any Pascal or C compiler with an inline assembler. This is also slightly faster and smaller, since I killed possible inefficiencies in the compiled code. The original routine was designed for speed, not for size, so I added an option that compiles a MUCH smaller (but slower) version: #define FAST /* leave undefined for the SMALL version */ #define ZERO(a) BlockZero(&(a), sizeof (a)) { HParamBlockRec pb; ZERO(pb); } void BlockZero(void *area, register unsigned long n); void BlockZero(void *area, register unsigned long n) { register char *p = (char *)area; register unsigned long temp; register unsigned long zero; #ifdef FAST asm { tst.l n beq.s @skipit /* only if len > 0 */ move.w p,temp andi.w #0x01,temp /* if odd offset, */ beq.s @skipeven clr.b (p)+ /* clear odd byte and */ subq.l #0x01,n /* make offset even */ @skipeven: move.l n,temp asr.l #0x02,temp /* len/4 (# of longs in area) */ bra.s @fourloop @clrloop: /* clear out longs speedily */ clr.l (p)+ @fourloop: dbf temp,@clrloop @nofours: move.b n,temp andi.b #0x03,temp /* check remainder (len % 4) */ bra.s @oddloop @clearodd: clr.b (p)+ /* clear remaining bytes */ @oddloop: dbf temp, @clearodd @skipit: } #else /* if SMALL */ asm { move.l n,temp bra.s @loopback @loop: clr.b (p)+ @loopback: dbf temp,@loop } #endif } /* BlockZero */ /* eof BlockZero.c */ - ---------------------------------------------------------------------------- Peter Creath "When I was a boy I was told that anybody could peterc@cubetech.com become president; I'm beginning to believe it." -- Clarence Darrow +++++++++++++++++++++++++++ From: mgleason@cse.unl.edu (Mike Gleason) Date: 21 Sep 92 03:10:42 GMT Organization: NCEMRSoft peterc@cubetech.com (Peter Creath) writes: >Lines: 75 >Here's my optimization on Mike Gleason's [original] BlockZero.c snippet: >(I'm gonna remove the comments within the code and explain it here...) >This is shorter, primarily because the code itself (not the declarations) >is entirely in assembly. Therefore, it can be used by any Pascal >or C compiler with an inline assembler. >This is also slightly faster and smaller, since I killed possible >inefficiencies in the compiled code. The original routine was designed >for speed, not for size, so I added an option that compiles a MUCH >smaller (but slower) version: Thanks for the improvements; unfortunately they were made to the first version. The second one I posted is over 3 times faster. It compiles to about 146 bytes (as opposed to 84 bytes for version 1), so you may want to re-optimize it for size if the extra space is bothersome. - --mg p.s., here's the benchmarks of my functions. Please don't make fun of my slow 7.x MHz CPU :-) Function 0 is Symantec's memset; 1 is my first BlockZero; 2 is the new BlockZero. 3 is Michael Hecht's wzbzeri; Function 0: 7 bytes in 89 ticks 59.91 KB/sec Function 1: 7 bytes in 112 ticks 47.61 KB/sec Function 2: 7 bytes in 162 ticks 32.91 KB/sec Function 3: 7 bytes in 170 ticks 31.36 KB/sec Function 0: 99 bytes in 464 ticks 162.52 KB/sec Function 1: 99 bytes in 216 ticks 349.12 KB/sec Function 2: 99 bytes in 194 ticks 388.71 KB/sec Function 3: 99 bytes in 261 ticks 288.93 KB/sec Function 0: 335 bytes in 1423 ticks 179.32 KB/sec Function 1: 335 bytes in 495 ticks 515.51 KB/sec Function 2: 335 bytes in 204 ticks 1250.86 KB/sec Function 3: 335 bytes in 460 ticks 554.73 KB/sec Function 0: 2048 bytes in 8154 ticks 191.32 KB/sec Function 1: 2048 bytes in 2499 ticks 624.25 KB/sec Function 2: 2048 bytes in 910 ticks 1714.29 KB/sec Function 3: 2048 bytes in 1291 ticks 1208.37 KB/sec Function 0: 308294 bytes in 1243151 ticks 188.90 KB/sec Function 1: 308294 bytes in 358464 ticks 655.11 KB/sec Function 2: 308294 bytes in 114221 ticks 2055.96 KB/sec Function 3: 308294 bytes in 170072 ticks 1380.79 KB/sec - -- - --mg mgleason@cse.unl.edu +++++++++++++++++++++++++++ From: keith@taligent.com (Keith Rollin) Date: 21 Sep 92 18:17:57 GMT Organization: Taligent In article <1992Sep21.031042.22722@unlinfo.unl.edu>, mgleason@cse.unl.edu (Mike Gleason) wrote: > > p.s., here's the benchmarks of my functions. Please don't make fun of my > slow 7.x MHz CPU :-) > > Function 0 is Symantec's memset; > 1 is my first BlockZero; > 2 is the new BlockZero. > 3 is Michael Hecht's wzbzeri; > > Function 2: 7 bytes in 162 ticks 32.91 KB/sec > Function 2: 99 bytes in 194 ticks 388.71 KB/sec > Function 2: 335 bytes in 204 ticks 1250.86 KB/sec > I'm sorry, but I gotta make fun of Mike's slow 7.x MHz CPU. NEARLY 3 SECONDS TO CLEAR 7 BYTES??? Also, what's the deal with it taking 162 "units" to clear 7 bytes, and only 20% longer to clear more than 12 times that much? Or 194 units to clear 99 bytes and only 10 units more to clear over 3 times that much? The overhead's gotta be killer. And the math doesn't work... (7 bytes / 162 ticks) * (60 ticks / 1 second) comes to .00259 KB/second. You'd have to assume Mike ran 12,693 iterations in order to get the tickcount he did on that first test. - ----- Keith Rollin Phantom Programmer Taligent, Inc. +++++++++++++++++++++++++++ From: mgleason@cse.unl.edu (Mike Gleason) Date: 21 Sep 92 22:36:09 GMT Organization: NCEMRSoft keith@taligent.com (Keith Rollin) writes: >In article <1992Sep21.031042.22722@unlinfo.unl.edu>, mgleason@cse.unl.edu >(Mike Gleason) wrote: >> >> p.s., here's the benchmarks of my functions. Please don't make fun of my >> slow 7.x MHz CPU :-) >> >> Function 0 is Symantec's memset; >> 1 is my first BlockZero; >> 2 is the new BlockZero. >> 3 is Michael Hecht's wzbzeri; >> >> Function 2: 7 bytes in 162 ticks 32.91 KB/sec >> Function 2: 99 bytes in 194 ticks 388.71 KB/sec >> Function 2: 335 bytes in 204 ticks 1250.86 KB/sec >> >I'm sorry, but I gotta make fun of Mike's slow 7.x MHz CPU. >NEARLY 3 SECONDS TO CLEAR 7 BYTES??? :-) I evidently forgot to mention that I was using VIA_ticks() that comes with Symantec's ANSI library. According to them there are "approximately 780,000" per second. Originally my test suite was using system Ticks, but 60 units per second isn't enough so I decided to try using the VIA timer that Symantec's profiler uses. >Also, what's the deal with it taking 162 "units" to clear 7 bytes, and only >20% longer to clear more than 12 times that much? Or 194 units to clear 99 >bytes and only 10 units more to clear over 3 times that much? The >overhead's gotta be killer. There is a bit of overhead. Here's a bit of pseudo-code to illustrate it: (Did the second version of the code get through? You apparently haven't seen it and I have had an email request. I guess I will just append it again at the end of this message; sorry if this is wasting bandwidth.) make a copy of 'n' for the counter; make a copy of the pointer to the area to be zeroed; align our ptr on a longword boundary; if (n >= 144) then zero 144 at a time, subtract amount zeroed from n; if (n >= 16) then zero 16 a time, subtract amount zeroed from n; if (n >= 1) then zero 1 at a time. If you look at the results for Symantec's memset(), which is just one byte at a time, it doesn't do much better for small sizes, despite having little overhead. It just looks like mine has a lot of overhead because it does so much better on big blocks. Function 0: (memset) 7 bytes in 89 ticks 59.91 KB/sec Function 0: 99 bytes in 464 ticks 162.52 KB/sec Function 0: 335 bytes in 1423 ticks 179.32 KB/sec - --mg (Version 2 of BlockZero follows...) /* BlockZero.c */ /* Handy macro: */ #define ZERO(block,size) BlockZero(&(block), (long) (size)) void BlockZero(void *area, long n); #define PTR a0 /* Our copy of 'area.' */ #define COUNT d0 /* Our copy of 'n.' */ /* * Using registers d1 - d7, and a1 - a5 to hold zeroes for use * with MOVEM. Could use a6 also if you can trick the inline * assembler to not use LINK/UNLK. */ #define nAddrRegsUsed 5 #define nDataRegsUsed 7 /* * One MOVEM instruction will clear this many bytes at a time. */ #define ZSize ((nAddrRegsUsed + nDataRegsUsed) * sizeof (long)) /* * This is how many MOVEM instructions you want in your loop. For each * extra one you add, then large blocks will clear faster; however * smaller blocks will be slower because it will use the intermediate * loop, where only 16 bytes are cleared at a time. */ #define UnrolledLoops 3 /* * For each UnrolledLoop, do one of these. */ #define Loop movem.l d1-d7/a1-a5, -(PTR) /* * This is how many bytes are cleared by all of your MOVEMs before * we have to decrement the count and start another loop iteration. */ #define TotalSize (ZSize * UnrolledLoops) void BlockZero(void *area, long n) { asm { movem.l d0-d7/a0-a5, -(sp) move.l n, COUNT ; Could use index from the SP movea.l area, PTR ; for these instead of a6. add.l COUNT, PTR ; Start at end of block, ; and go backwards. move.w PTR, d1 ; Find (area + n) % 4. andi.w #3, d1 sub.l d1, COUNT ; We will byte-zero this ; many bytes to longword align, ; so subtract this from the ; total. moveq.l #0, d2 bra @2 ; Zero the last 3 or so @1: move.b d2, -(PTR) ; bytes to align PTR on a @2: dbra d1, @1 ; longword boundary. ; Clear out as many registers ; as possible to use with MOVEM. move.l d2, d1 ; Don't need d1 for scratch. move.l d2, d3 move.l d2, d4 cmpi.l #TotalSize, COUNT bls @6 ; If it is a smaller block, then ; jump to the section that clears ; 16 at a time. ; Otherwise we'll need to clear ; out the rest of these registers ; to use in a huge MOVEM. move.l d2, d5 move.l d2, d6 move.l d2, d7 movea.l d2, a1 movea.l a1, a2 movea.l a1, a3 movea.l a1, a4 movea.l a1, a5 ; Now zero as much as we can ; in blocks of 144 (3 movem's ; of 48 bytes each). bra @4 @3: Loop ; Add more Loops here if you Loop ; like, but adjust the #defines Loop ; above. @4: subi.l #TotalSize, COUNT bge @3 ; The remaining block is now ; less than TotalSize, so now ; zero as much as we can ; in blocks of 16. addi.l #TotalSize, COUNT bra @6 @5: movem.l d1-d4, -(PTR) @6: subi.l #16, COUNT bge @5 ; The remaining block is now ; less than 16, so now ; use byte-zeroing. addi.l #16, COUNT bra @8 @7: move.b d2, -(PTR) @8: dbra COUNT, @7 ; Restore nuked registers. movem.l (sp)+,d0-d7/a0-a5 } } /* BlockZero */ /* eof BlockZero.c */ - -- /* * Mike Gleason, NCEMRSoft Worldwide Development. mgleason@cse.unl.edu * Patriots fan, and damn proud of it! "Bienvenido a Macintosh" --7.0e1 */ +++++++++++++++++++++++++++ From: wombat@claris.com (Scott Lindsey) Date: 21 Sep 92 23:13:23 GMT Organization: Claris Corporation, Vancouver WA In article <D88-JWA.92Sep18144437@dront.nada.kth.se>, d88-jwa@dront.nada.kth.se (Jon Wtte) writes: > > > How about MOVEQ.L? > On the 68000, MOVEQ can only move 8 bits into a data register; the 8 bits are sign-extended into a long. MOVEQ's timing is 8(2/0); 4(1/0) on an '010. - -- Scott Lindsey <wombat@claris.com> +++++++++++++++++++++++++++ From: ldo@waikato.ac.nz (Lawrence D'Oliveiro, Waikato University) Date: 23 Sep 92 06:08:56 GMT Organization: University of Waikato, Hamilton, New Zealand In article <1992Sep21.223609.4608@unlinfo.unl.edu>, mgleason@cse.unl.edu (Mike Gleason) writes: > :-) I evidently forgot to mention that I was using VIA_ticks() that comes > with Symantec's ANSI library. According to them there are "approximately > 780,000" per second. Originally my test suite was using system Ticks, but > 60 units per second isn't enough so I decided to try using the VIA timer > that Symantec's profiler uses. Why not just use the Time Manager? You can get microsecond accuracy (none of this "approximately 780,000" business). Details in IM6. Lawrence D'Oliveiro fone: +64-7-856-2889 Computer Services Dept fax: +64-7-838-4066 University of Waikato electric mail: ldo@waikato.ac.nz Hamilton, New Zealand 37^ 47' 26" S, 175^ 19' 7" E, GMT+12:00 --------------------------- End of C.S.M.P. Digest **********************